redescription mining
Hashing for Fast Pattern Set Selection
Karjalainen, Maiju, Miettinen, Pauli
Pattern set mining, which is the task of finding a good set of patterns instead of all patterns, is a fundamental problem in data mining. Many different definitions of what constitutes a good set have been proposed in recent years. In this paper, we consider the reconstruction error as a proxy measure for the goodness of the set, and concentrate on the adjacent problem of how to find a good set efficiently. We propose a method based on bottom-k hashing for efficiently selecting the set and extend the method for the common case where the patterns might only appear in approximate form in the data. Our approach has applications in tiling databases, Boolean matrix factorization, and redescription mining, among others. We show that our hashing-based approach is significantly faster than the standard greedy algorithm while obtaining almost equally good results in both synthetic and real-world data sets.
A redescription mining framework for post-hoc explaining and relating deep learning models
Mihelčić, Matej, Grubišić, Ivan, Keber, Miha
Deep learning models (DLMs) achieve increasingly high performance both on structured and unstructured data. They significantly extended applicability of machine learning to various domains. Their success in making predictions, detecting patterns and generating new data made significant impact on science and industry. Despite these accomplishments, DLMs are difficult to explain because of their enormous size. In this work, we propose a novel framework for post-hoc explaining and relating DLMs using redescriptions. The framework allows cohort analysis of arbitrary DLMs by identifying statistically significant redescriptions of neuron activations. It allows coupling neurons to a set of target labels or sets of descriptive attributes, relating layers within a single DLM or associating different DLMs. The proposed framework is independent of the artificial neural network architecture and can work with more complex target labels (e.g. multi-label or multi-target scenario). Additionally, it can emulate both pedagogical and decompositional approach to rule extraction. The aforementioned properties of the proposed framework can increase explainability and interpretability of arbitrary DLMs by providing different information compared to existing explainable-AI approaches.
Fast Redescription Mining Using Locality-Sensitive Hashing
Karjalainen, Maiju, Galbrun, Esther, Miettinen, Pauli
A redescription is a pattern that characterises roughly the same entities in two different ways, and redescription mining is the task of automatically extracting redescriptions from the input dataset, given user-defined constraints. Redescription mining has found applications in various fields of science, such as ecometrics. Ecometrics aims to identify and model the functional relationships between traits of organisms and their environments [5, 7]. For instance, the teeth of large plant-eating mammals are adapted to the food that is available in their environment, which in turn depends on the climatic conditions, potentially allowing one to reason about the climate in the past based on the fossil record. To apply redescription mining in this context, the entities in the dataset represent localities, with two sets of attributes recording respectively the distribution of dental traits among species and the climatic conditions at each locality [11, 19].
Multi-view redescription mining using tree-based multi-target prediction models
Mihelčić, Matej, Džeroski, Sašo, Šmuc, Tomislav
The task of redescription mining is concerned with re-describing different subsets of entities contained in a dataset and revealing non-trivial associations between different subsets of attributes, called views. This interesting and challenging task is encountered in different scientific fields, and is addressed by a number of approaches that obtain redescriptions and allow for the exploration and analysis of attribute associations. The main limitation of existing approaches to this task is their inability to use more than two views. Our work alleviates this drawback. We present a memory efficient, extensible multi-view redescription mining framework that can be used to relate multiple, i.e. more than two views, disjoint sets of attributes describing one set of entities. The framework includes: a) the use of random forest of Predictive Clustering trees, with and without random output selection, and random forests of Extra Predictive Clustering trees, b) using Extra Predictive Clustering trees as a main rule generation mechanism in the framework and c) using random view subset projections. We provide multiple performance analyses of the proposed framework and demonstrate its usefulness in increasing the understanding of different machine learning models, which has become a topic of growing importance in machine learning and especially in the field of computer science called explainable data science.
Using Redescription Mining to Relate Clinical and Biological Characteristics of Cognitively Impaired and Alzheimer's Disease Patients
Mihelčić, Matej, Šimić, Goran, Leko, Mirjana Babić, Lavrač, Nada, Džeroski, Sašo, Šmuc, Tomislav
We used redescription mining to find interpretable rules revealing associations between those determinants that provide insights about the Alzheimer's disease (AD). We extended the CLUS-RM redescription mining algorithm to a constraint-based redescription mining (CBRM) setting, which enables several modes of targeted exploration of specific, user-constrained associations. Redescription mining enabled finding specific constructs of clinical and biological attributes that describe many groups of subjects of different size, homogeneity and levels of cognitive impairment. We confirmed some previously known findings. However, in some instances, as with the attributes: testosterone, the imaging attribute Spatial Pattern of Abnormalities for Recognition of Early AD, as well as the levels of leptin and angiopoietin-2 in plasma, we corroborated previously debatable findings or provided additional information about these variables and their association with AD pathogenesis. Applying redescription mining on ADNI data resulted with the discovery of one largely unknown attribute: the Pregnancy-Associated Protein-A (PAPP-A), which we found highly associated with cognitive impairment in AD. Statistically significant correlations (p <= 0.01) were found between PAPP-A and various different clinical tests. The high importance of this finding lies in the fact that PAPP-A is a metalloproteinase, known to cleave insulin-like growth factor binding proteins. Since it also shares similar substrates with A Disintegrin and the Metalloproteinase family of enzymes that act as {\alpha}-secretase to physiologically cleave amyloid precursor protein (APP) in the non-amyloidogenic pathway, it could be directly involved in the metabolism of APP very early during the disease course. Therefore, further studies should investigate the role of PAPP-A in the development of AD more thoroughly.